The primary goal of this project is to demonstrate the process of exploring a real world data set downloaded from Kaggle with R programming, and to share the results. Any discoveries made are aimed at forming future hypotheses regarding the nature of publication success for different variables, such as page number or publication date. Publication success is quantified here through average ratings and the number of written text reviews for each publication.

The secondary goal of this project is to demonstrate proficiency with R programming, R Markdown, and various statistical visualization methods.

After some initial cleaning of the dataset in Excel, the exploration of the goodreads data in R can begin.

=======

Setting the working directory

setwd("~/Documents/Current_Classes/R_Programming/Internship Project")
>>>>>>> parent of 9156b0d (update)

Importing the cleaned csv file

books <- read.csv("books.csv")
head(books)
##   bookID
## 1      1
## 2      2
## 3      4
## 4      5
## 5      8
## 6      9
##                                                                                     title
## 1                               Harry Potter and the Half-Blood Prince (Harry Potter  #6)
## 2                            Harry Potter and the Order of the Phoenix (Harry Potter  #5)
## 3                              Harry Potter and the Chamber of Secrets (Harry Potter  #2)
## 4                             Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)
## 5                                  Harry Potter Boxed Set  Books 1-5 (Harry Potter  #1-5)
## 6 Unauthorized Harry Potter Book Seven News: "Half-Blood Prince" Analysis and Speculation
##                       authors average_rating       isbn       isbn13
## 1 J.K. Rowling/Mary GrandPré           4.57  439785960 9.780440e+12
## 2 J.K. Rowling/Mary GrandPré           4.49  439358078 9.780439e+12
## 3                J.K. Rowling           4.42  439554896 9.780440e+12
## 4 J.K. Rowling/Mary GrandPré           4.56 043965548X 9.780440e+12
## 5 J.K. Rowling/Mary GrandPré           4.78  439682584 9.780440e+12
## 6      W. Frederick Zimmerman           3.74  976540606 9.780977e+12
##   language_code num_pages ratings_count text_reviews_count publication_date
## 1           eng       652       2095690              27591       2006-09-16
## 2           eng       870       2153167              29221       2004-09-01
## 3           eng       352          6333                244       2003-11-01
## 4           eng       435       2339585              36325       2004-05-01
## 5           eng      2690         41428                164       2004-09-13
## 6         en-US       152            19                  1       2005-04-26
##         publisher
## 1 Scholastic Inc.
## 2 Scholastic Inc.
## 3      Scholastic
## 4 Scholastic Inc.
## 5      Scholastic
## 6    Nimble Books

Deleting the ISBN and ISBN13 columns

The ISBN numbers for any given title are tools used by publishers to keep track of book titles. Since we already have a title column for each book in the data set, these columns are redundant.

head(books[,5:6]) #Here are the columns to be deleted
##         isbn       isbn13
## 1  439785960 9.780440e+12
## 2  439358078 9.780439e+12
## 3  439554896 9.780440e+12
## 4 043965548X 9.780440e+12
## 5  439682584 9.780440e+12
## 6  976540606 9.780977e+12
books <- books[,-5:-6]
colnames(books) #Checking to make sure the two columns have been removed
##  [1] "bookID"             "title"              "authors"           
##  [4] "average_rating"     "language_code"      "num_pages"         
##  [7] "ratings_count"      "text_reviews_count" "publication_date"  
## [10] "publisher"

Initial Look into the Data

summary(books)
##      bookID         title             authors          average_rating 
##  Min.   :    1   Length:11127       Length:11127       Min.   :0.000  
##  1st Qu.:10287   Class :character   Class :character   1st Qu.:3.770  
##  Median :20287   Mode  :character   Mode  :character   Median :3.960  
##  Mean   :21311                                         Mean   :3.934  
##  3rd Qu.:32104                                         3rd Qu.:4.135  
##  Max.   :45641                                         Max.   :5.000  
##  language_code        num_pages      ratings_count     text_reviews_count
##  Length:11127       Min.   :   0.0   Min.   :      0   Min.   :    0.0   
##  Class :character   1st Qu.: 192.0   1st Qu.:    104   1st Qu.:    9.0   
##  Mode  :character   Median : 299.0   Median :    745   Median :   46.0   
##                     Mean   : 336.4   Mean   :  17936   Mean   :  541.9   
##                     3rd Qu.: 416.0   3rd Qu.:   4994   3rd Qu.:  237.5   
##                     Max.   :6576.0   Max.   :4597666   Max.   :94265.0   
##  publication_date    publisher        
##  Length:11127       Length:11127      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

The publication date is in a character format, preventing proper analysis on that column. This can be fixed with the stringr package.

typeof(books$publication_date)
## [1] "character"
head(books$publication_date)
## [1] "2006-09-16" "2004-09-01" "2003-11-01" "2004-05-01" "2004-09-13"
## [6] "2005-04-26"
library(stringr)

nchar(head(books$publication_date))
## [1] 10 10 10 10 10 10
str_sub(books$publication_date, 5, 10) <- ""  #This removes the month and day values
books$publication_date <- as.numeric(books$publication_date) #This makes sure the resulting values are numeric
## Warning: NAs introduced by coercion
head(books$publication_date) #Checking to make sure we no longer have character strings as date values
## [1] 2006 2004 2003 2004 2004 2005
typeof(books$publication_date) #Success!
## [1] "double"

Renaming the dates column to something more convenient

colnames(books)
##  [1] "bookID"             "title"              "authors"           
##  [4] "average_rating"     "language_code"      "num_pages"         
##  [7] "ratings_count"      "text_reviews_count" "publication_date"  
## [10] "publisher"
newnames <- c("bookID", "title", "authors", "average_rating", "lang_code", "num_pages", "num_ratings", "num_text_reviews", "pub_date", "publisher")
colnames(books) <- newnames
colnames(books) #Checking to make sure the columns have the appropriate names
##  [1] "bookID"           "title"            "authors"          "average_rating"  
##  [5] "lang_code"        "num_pages"        "num_ratings"      "num_text_reviews"
##  [9] "pub_date"         "publisher"

Quick look at the cleaned data

summary(books)
##      bookID         title             authors          average_rating 
##  Min.   :    1   Length:11127       Length:11127       Min.   :0.000  
##  1st Qu.:10287   Class :character   Class :character   1st Qu.:3.770  
##  Median :20287   Mode  :character   Mode  :character   Median :3.960  
##  Mean   :21311                                         Mean   :3.934  
##  3rd Qu.:32104                                         3rd Qu.:4.135  
##  Max.   :45641                                         Max.   :5.000  
##                                                                       
##   lang_code           num_pages       num_ratings      num_text_reviews 
##  Length:11127       Min.   :   0.0   Min.   :      0   Min.   :    0.0  
##  Class :character   1st Qu.: 192.0   1st Qu.:    104   1st Qu.:    9.0  
##  Mode  :character   Median : 299.0   Median :    745   Median :   46.0  
##                     Mean   : 336.4   Mean   :  17936   Mean   :  541.9  
##                     3rd Qu.: 416.0   3rd Qu.:   4994   3rd Qu.:  237.5  
##                     Max.   :6576.0   Max.   :4597666   Max.   :94265.0  
##                                                                         
##     pub_date     publisher        
##  Min.   :1900   Length:11127      
##  1st Qu.:1998   Class :character  
##  Median :2003   Mode  :character  
##  Mean   :2000                     
##  3rd Qu.:2005                     
##  Max.   :2020                     
##  NA's   :1

Without removing extreme cases, here is the baseline data:

The average rating for books in this data set is 3.934
The average number of pages for books in this data set is 299
The average number of ratings for books in this data set is 17,936
The average publication date for books in this data set is 2000, with the earliest being 1900 and the most recent being 2020.

Initial exploration of the data with some simple R code

books[which.max(books$num_pages), ] #The book in the dataset with the most number of pages is a 5 volume set of the Complete Aubrey/Maturin Novels
##      bookID                                          title         authors
## 6501  24520 The Complete Aubrey/Maturin Novels (5 Volumes) Patrick O'Brian
##      average_rating lang_code num_pages num_ratings num_text_reviews pub_date
## 6501            4.7       eng      6576        1338               81     2004
##                  publisher
## 6501 W. W. Norton  Company

Which book had the highest number of text reviews?

books[which.max(books$num_text_reviews),] #Ahh lovely... Twilight...
##       bookID                   title         authors average_rating lang_code
## 10341  41865 Twilight (Twilight  #1) Stephenie Meyer           3.59       eng
##       num_pages num_ratings num_text_reviews pub_date                 publisher
## 10341       501     4597666            94265     2006 Little  Brown and Company

Which book had the lowest rating?

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
book_rated <- books %>% filter(num_ratings>0)

book_rated[which.min(book_rated$average_rating),]
##      bookID                                    title         authors
## 3215  11854 Puzzle Pack: The Witch of Blackbird Pond Mary B. Collins
##      average_rating lang_code num_pages num_ratings num_text_reviews pub_date
## 3215              1       eng       134           2                0     2005
##                             publisher
## 3215 Teacher's Pet Publications  Inc.

Which book had the highest rating?

book_extra_rated <- books %>% filter(num_ratings>100)
book_extra_rated[which.max(book_extra_rated$average_rating),]
##      bookID                          title        authors average_rating
## 5065  24812 The Complete Calvin and Hobbes Bill Watterson           4.82
##      lang_code num_pages num_ratings num_text_reviews pub_date
## 5065       eng      1456       32213              930     2005
##                      publisher
## 5065 Andrews McMeel Publishing

Using the ggplot2 histogram function to look deeper into the dataset

library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
num_pages_plot <- ggplot(books, aes(num_pages)) + #The majority of books are under 500 pages in length
                    geom_histogram() +
                    labs(title="Book Lengths", x="Number of Pages per Book", y="Frequency") +
                    xlim(0,1500)
  
text_rev_plot <- ggplot(books, aes(num_text_reviews)) + #The vast majority of books have under 5000 text reviews
                    geom_histogram() +
                    labs(title="Number of Text Reviews", x="Number of Text Reviews per Book", y="Frequency") +
                    xlim(0,10000) +
                    ylim(0,600)

grid.arrange(num_pages_plot, text_rev_plot, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 29 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 94 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing missing values (geom_bar).

There is an interesting similarity in these two histograms, suggesting a relationship between number of pages and number of text reviews for each book title.
avg_ratings_plot <- ggplot(books, aes(average_rating)) + #Most books have an average rating somewhere between 3.5 and 4.5
                      geom_histogram() +
                      labs(title="Average Ratings", x="Average Ratings Per Book", y="Frequency") +
                      xlim(2.5,5)
avg_ratings_plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 37 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

The vast majority of titles had average ratings between 3.5 and 4.5, however outliers are present.

Investigating relationships between the variables with ggplot2 scatter plots, using appropriate x and y-axis limitations to view the bulk of the data

The various ylim and xlim values were largely obtained using the +-1.5(IQR) formula, using common sense when necessary.

To further explore the dataset I will create an interactive multivariate plot, with language as the grouping variable.

I have already identified a potential relationship between number of pages, number of text reviews, and average rating. Now I would like to see if there are any relationships between publishing language, publication date, number of pages, and the average_rating of books and collections in the dataset.
unique(books$lang_code)
##  [1] "eng"   "en-US" "fre"   "spa"   "en-GB" "mul"   "grc"   "enm"   "en-CA"
## [10] "ger"   "jpn"   "ara"   "nl"    "zho"   "lat"   "por"   "srp"   "ita"  
## [19] "rus"   "msa"   "glg"   "wel"   "swe"   "nor"   "tur"   "gla"   "ale"
books$lang_code <- factor(books$lang_code, ordered=TRUE, levels=c("eng", "en-US", "fre", "spa", "en-GB", "mul", "grc", "enm", "en-CA", "ger", "jpn", "ara", "nl", 
                                                                  "zho", "lat", "por", "srp", "ita", "rus", "msa", "glg", "wel", "swe", "nor", "tur", "gla", "ale"))

I only want lang_code, num_pages, average_rating, and pub_date, so I will eliminate all other columns and create a new dataframe called “mvbooks”.

mvbooks<-books[,-1:-3]
mvbooks<-mvbooks[,-7]
mvbooks<-mvbooks[,-4:-5]
colnames(mvbooks)
## [1] "average_rating" "lang_code"      "num_pages"      "pub_date"

Now the four desired variables are singled out in the “mvbooks” dataframe, however for convenience I want the grouping variable, “lang_code”, on the far left of the multivariate plot.

mvbooks2 <- mvbooks[,c(2,4,3,1)]
colnames(mvbooks2)
## [1] "lang_code"      "pub_date"       "num_pages"      "average_rating"
summary(mvbooks2)
##    lang_code       pub_date      num_pages      average_rating 
##  eng    :8911   Min.   :1900   Min.   :   0.0   Min.   :0.000  
##  en-US  :1409   1st Qu.:1998   1st Qu.: 192.0   1st Qu.:3.770  
##  spa    : 218   Median :2003   Median : 299.0   Median :3.960  
##  en-GB  : 214   Mean   :2000   Mean   : 336.4   Mean   :3.934  
##  fre    : 144   3rd Qu.:2005   3rd Qu.: 416.0   3rd Qu.:4.135  
##  ger    :  99   Max.   :2020   Max.   :6576.0   Max.   :5.000  
##  (Other): 132   NA's   :1

Now to plot the newly created “mvbooks2” dataframe.

library(cdparcoord)
## Loading required package: data.table
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: freqparcoord
## Loading required package: parallel
## Loading required package: GGally
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## Loading required package: FNN
## Loading required package: mvtnorm
## 
##    
## 
##    
## 
##    For a quick introduction, type ?freqparcoord, and
##    run the examples, making sure to read the comments.
##    
## 
## 
## Loading required package: partools
## Loading required package: regtools
## Loading required package: dummies
## dummies-1.5.6 provided by Decision Patterns
## Loading required package: sandwich
## 
## 
## 
## 
## 
## *********************
## 
## 
## 
## Latest version of regtools at GitHub.com/matloff
## 
## 
## Type "?regtools" for function list.
## Loading required package: pdist
## Latest version of partools at GitHub.com/matloff
## 
## 
## 
## 
## 
## *********************
## 
## 
## 
## 
## 
## 
## Type ?quickstart for cdparcoord quick start 
## 
## 
## 
## 
## `
mm <- discretize(mvbooks2,nlevels=100) 
discparcoord(mm,k=5000,saveCounts=FALSE,name="test") 
<<<<<<< HEAD
=======
>>>>>>> parent of 9156b0d (update)

This plot is far too cluttered, and would be more useful if there were fewer years and page numbers smushed together.

summary(mvbooks2$pub_date)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1900    1998    2003    2000    2005    2020       1

I will use the 1.5(IQR) strategy to figure out which years are considered likely outliers. The IQR of pub_date is 2005-1998, or 7 years. 7(1.5)=10.5. 2005+10.5=2010.5, or 2011 rounding up. 1998-10.5=1987.5, or 1987 rounding down. Let’s just look at books published between 1987 and 2011.

mvbooks3 <- mvbooks2 %>% filter(pub_date >= 1987, pub_date <= 2011)
summary(mvbooks3)
##    lang_code       pub_date      num_pages      average_rating 
##  eng    :8294   Min.   :1987   Min.   :   0.0   Min.   :0.000  
##  en-US  :1325   1st Qu.:1999   1st Qu.: 196.0   1st Qu.:3.770  
##  spa    : 208   Median :2003   Median : 302.0   Median :3.960  
##  en-GB  : 199   Mean   :2002   Mean   : 335.7   Mean   :3.932  
##  fre    : 132   3rd Qu.:2005   3rd Qu.: 416.0   3rd Qu.:4.130  
##  ger    :  93   Max.   :2011   Max.   :6576.0   Max.   :5.000  
##  (Other): 123
The same process will be repeated for the “num_pages” column.
summary(mvbooks3$num_pages)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   196.0   302.0   335.7   416.0  6576.0

The IQR of num_pages is 220. 220(1.5)=330. 196, the 1st quartile, ends up as a negative number with 196-330= -134, so we can ignore that for now. 416, however, the 3rd quartile, leads to 416+330 = 746. Let’s narrow down the books to those that have less than or equal to 746 pages.

summary(mvbooks3$num_pages)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   196.0   302.0   335.7   416.0  6576.0
mvbooks3 <- mvbooks3 %>% filter(num_pages <= 746)
summary(mvbooks3)
##    lang_code       pub_date      num_pages     average_rating 
##  eng    :7875   Min.   :1987   Min.   :  0.0   Min.   :0.000  
##  en-US  :1262   1st Qu.:1999   1st Qu.:192.0   1st Qu.:3.770  
##  spa    : 199   Median :2003   Median :288.0   Median :3.950  
##  en-GB  : 193   Mean   :2002   Mean   :300.5   Mean   :3.921  
##  fre    : 125   3rd Qu.:2005   3rd Qu.:392.0   3rd Qu.:4.120  
##  ger    :  86   Max.   :2011   Max.   :746.0   Max.   :5.000  
##  (Other): 119

With the newly created “mvbooks3” dataframe, the multivariate visualization should be less cluttered.

mm <- discretize(mvbooks3,nlevels=100) 
discparcoord(mm,k=5000,saveCounts=FALSE,name="test")
<<<<<<< HEAD
=======
>>>>>>> parent of 9156b0d (update)

While the plot is still fairly cluttered, I still made an interesting finding. I figured out that there is atleast 1 book in the dataset published in 1989, with fewer than 3 pages but an average GoodReads rating of 4 or greater. It appears that they were published in English. Very odd.

Verifying the existence of the strange findings using an sql query

sqldf("select * from books where pub_date == 1989 and average_rating >= 4 and num_pages <= 3")
##   bookID                                   title
## 1   5545 The Feynman Lectures on Physics  3 Vols
## 2  21931                 The Day Before Midnight
##                                                  authors average_rating
## 1 Richard P. Feynman/Robert B. Leighton/Matthew L. Sands           4.60
## 2                            Stephen Hunter/Philip Bosco           4.01
##   lang_code num_pages num_ratings num_text_reviews pub_date
## 1     en-US         3          78                7     1989
## 2       eng         0           1                0     1989
##                           publisher
## 1 Addison Wesley Publishing Company
## 2                Random House Audio

There are two books that meet this strange criteria, however double checking with the current GoodReads website reveals that these were entered with inaccurate values. The Day Before Midnight in reality has 434 pages according to google books, and the Feynman Lectures on Physics actually has 384 pages, according to GoodReads as of current. A careful analysis of the visualization, followed up by verification, lead to the discovery of an important error in the dataset. There are likely other such errors that were not caught as well.

Just for fun, let’s make a random book generator that uses titles from the GoodReads dataset we are working with

gimmeABook <- function(){
  range <- c(1:11127)
  rando <- sample(range, 1)
  books[rando,]
}

gimmeABook()
<<<<<<< HEAD
##      bookID                                                  title      authors
## 6914  26054 The MacGregors: Serena & Caine (The MacGregors  #1 -2) Nora Roberts
##      average_rating lang_code num_pages num_ratings num_text_reviews pub_date
## 6914           4.09       eng       441       11070              150     2006
##      publisher
## 6914      Mira
gimmeABook()
##      bookID      title       authors average_rating lang_code num_pages
## 3598  13040 Siddhartha Hermann Hesse           4.02       eng       132
##      num_ratings num_text_reviews pub_date        publisher
## 3598         514               40     1993 Turtleback Books
gimmeABook()
##     bookID                        title      authors average_rating lang_code
## 826   2731 Advanced Global Illumination Philip Dutre            4.5       eng
##     num_pages num_ratings num_text_reviews pub_date  publisher
## 826       366          17                2     2006 A K PETERS
gimmeABook()
##      bookID                               title        authors average_rating
## 2327   8506 Thomas Jefferson (Oxford Portraits) R.B. Bernstein           4.01
##      lang_code num_pages num_ratings num_text_reviews pub_date
## 2327       eng       253        4245              133     2005
##                         publisher
## 2327 Oxford University Press  USA

Through this exploration of the GoodReads data set in R, I have identified potential relationships between book length and average rating, as well as between book length and the number of text reviews. Books and collections with over 500 pages tend to have a more reliable average rating between 3.5 and 4.5 than books with less than 500 pages. In addition, books and collections with over 500 pages tend to have fewer total text reviews. This indicates that while fewer overall people are likely reading these longer books and collections, they are having more reliably positive experiences while reading them.

I have also identified two important pieces of information regarding the data set. One, the data includes both individual books and collections of books. This was verified by an sql query, which revealed that the vast majority of titles in the data set with more than 1200 pages are collections of books, rather than standalone publications. Two, there were likely many mistakes made by the creator of this data when entering the information. This was first discovered while exploring the multivariate plot created with the cdparcoord package, and then verified with sql queries. Two items in the data set were found to have incorrect information regarding page numbers. Because of this, an unknown number of titles in the data set could have false page number information. This could only be ameliorated through cross-checking each page number entry with outside sources.

Thank you for reading,

–Greg

=======
##      bookID              title        authors average_rating lang_code
## 4011  14257 English Passengers Matthew Kneale           4.06       eng
##      num_pages num_ratings num_text_reviews pub_date publisher
## 4011       446        4863              409     2001    Anchor
gimmeABook()
##      bookID                     title    authors average_rating lang_code
## 4185  14975 Labyrinth (Languedoc  #1) Kate Mosse           3.57       eng
##      num_pages num_ratings num_text_reviews pub_date     publisher
## 4185       515       40300             2498     2007 Berkley Books
gimmeABook()
##      bookID                           title                           authors
## 9777  39102 The Collected Poetry  1968-1998 Nikki Giovanni/Virginia C. Fowler
##      average_rating lang_code num_pages num_ratings num_text_reviews pub_date
## 9777           4.42       eng       512        1249               52     2007
##           publisher
## 9777 William Morrow
gimmeABook()
##      bookID                                                       title
## 9093  35421 Star Wars:  The New Essential Guide to Weapons & Technology
##                authors average_rating lang_code num_pages num_ratings
## 9093 W. Haden Blackman           4.08       eng       224         211
##      num_text_reviews pub_date publisher
## 9093                7     2004   Del Rey

It works!! As a book lover this function is surprisingly fun

While I have not exhausted the possibilities of analysis and manipulation of this dataset, and failed to find any meangingul relationships between the examined variables, I have displayed a number of different pathways for analysis and visualization. SQL queries were implemented to check for odd findings in the visualizations, and the writing of functions were demonstrated to playfully explore the dataset.
I hope reading this markdown file was as enjoyable for you as creating it was for me. Until next time, stay curious.
–Greg
>>>>>>> parent of 9156b0d (update)